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ABSTRACT 

We investigate applying general-purpose join algorithms to 
the triangle listing problem in an out-of-core context. In 
particular, we focus on Leapfrog Triejoin (LFTJ) by Veld- 
huizen[^, a recently proposed, worst-case optimal algo¬ 
rithm. We present “boxing”: a novel, yet conceptually sim¬ 
ple, approach for feeding input data to LFTJ. Our exten¬ 
sive analysis shows that this approach is I/O efficient, be¬ 
ing worst-case optimal (in a certain sense). Furthermore, if 
input data is only a constant factor larger than the avail¬ 
able memory, then a boxed LFTJ essentially maintains the 
CPU data-complexity of the vanilla LFTJ. Next, focusing on 
LFTJ applied to the triangle query, we show that for many 
graphs boxed LFTJ matches the I/O complexity of the re¬ 
cently by Hu, Tao and Yufei proposed specialized algorithm 
MGT 10 for listing tiangles in an out-of-core setting. We 


also strengthen the analysis of LFTJ’s computational com¬ 
plexity for the triangle query by considering families of in¬ 
put graphs that are characterized not only by the number of 
edges but also by a measure of their density. E.g., we show 
that LFTJ achieves a CPU complexity of 0{\E\ log|i5|) for 
planar graphs, while on general graphs, no algorithm can be 
faster than 0{\E\^'^). Finally, we perform an experimental 
evaluation for the triangle listing problem confirming our 
theoretical results and showing the overall effectiveness of 
our approach. On all our real-world and synthetic data sets 
(some of which containing more than 1.2 billion edges) LFTJ 
in single-threaded mode is within a factor of 3 of the spe¬ 
cialized MGT; a penalty that—as we demonstrate—can be 
alleviated by parallelization. 


Categories and Subject Descriptors 

H.2.4 [Systems]: Query processing. Parallel databases 
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★ 

The following is “work-in-progress”, yet has some promising results I 
am happy to share. The results I am most proud of are: (1) Achieved per¬ 
formance in absolute numbers on large datasets. It is very exciting that 
these are achieved not by a specialized triangle-listing algorithm but by a 
general-purpose join algorithm. (2) The theoretical results about the 
in-memory LFTJ and its optimiality for input instance classes characterized 
by their arboricity (Thms. 17 & 19). (3) The simplicity of the proposed 
out-of-core technique and the elegance for subsetting TrieArrays. While 
the achieved I/O complexity for the out-of-core technique can likely be im¬ 
proved, the experiments show that this is not the bottleneck. 


1. INTRODUCTION 

Hu, Tao, and Yufei [10| recently proposed a novel algo¬ 
rithm (MGT) for listing triangles in large graphs that is 
both I/O and GPU efficient; and also outperforms existing 
competitors by an order of magnitude. At the same time, 
there has been exciting theoretical research that shows it 
is possible to design so-called worst-case optimal join algo¬ 
rithms |36[ |20[ . This begs the question: How would 

general-purpose join algorithms compare to the best spe¬ 
cialized triangle-listing algorithms in a setting where not all 
data fits into main memory? 

This question is motivated by the desire of building general- 
purpose systems that can empower their (domain) users to 
pose and run queries in a declarative and general language, 
such as SQL or Datalog—a goal that likely is little controver¬ 
sial. We focus on the out-of-core setting not only because of 
the obvious reasons of input or intermediary data not fitting 
in main memory, but also because we like to utilize graphics 
processing cards (GPUs) as high-throughput co-processors 
during query evaluation. GPU memory is currently limited 
to up to around 12GB [^, highlighting the urgency for ro¬ 
bust out-of-core techniques. 

The triangle listing is the basic building block for many 
other graph algorithms and key ingredient for graph metrics 
such as triangular clustering, finding cohesive subgraphs etc 
[10[ |30[ |31[ |25| . In addition, it has gotten extensive atten¬ 
tion in the research literature among several fields: graph 
theory, databases and network analysis to name a few. Here, 
both in-memory as well as in an out-of-core algorithms have 
been studied. Having a general-purpose technique being able 
to compete with the best-in-class hand-crafted algorithms 
that are specific for triangle listing, would indeed, be very 
good news for the database community advocating high- 
level, declarative query languages. 

We selected Leapfrog-Triejoin (LFTJ) by Veldhuizen as 
the general-purpose join algorithm for our study. This is 
for various reasons: (1) its elegance allows efficient imple¬ 
mentations with various optimizations, (2) by nature, LFTJ 
only uses 0(1) intermediary data, making it a very good 
candidate in the out-of-core context, and (3) because of its 
strong theoretical worst-case guarantees LFTJ’s worst- 
case guarantee in its generality is technical [^. Roughly, 
it guarantees that for a given query and input 7, LFTJ will 
never perform asymptotically more steps (up to a log-factor) 
than what are strictly necessary for any correct algorithm 
on inputs I' that are similar to I. Here, similar means that, 
eg., the sizes of the input relations cannot change nor can 
certain other statistics of the data. 



Model & Assumptions. We restrict our attention to full- 
conjunctive queries, and use a Datalog syntax and terminol¬ 
ogy to describe queries (or joins). Our formal setting is the 
standard one for considering I/O efficient algorithms: In¬ 
put, intermediary and output data can exceed the amount 
of available main memory M (measured in words to store 
one atomic value), in which case it can be read (written) 
from (to) secondary storage with the granularity of a block 
that has size B. Reading or writing a block incurs 1 unit of 
I/O cost. For I/O and CPU cost, we consider data complex¬ 
ities, that is we assume the query to be hxed and small. In 
particular, we like M/B to be larger than, say 10 times, the 
number of atoms multiplied by their maximum arity. Fur¬ 
thermore, to simplify complexity results, we assume that 
\I\/B is larger than log|7|. This restriction is mostly theo¬ 
retic: Using a block size of 64KiB with a 64-bit word-width, 
inputs only need to be larger than 15MiB to satisfy the re¬ 
quirement. With these assumptions in mind, we make the 
following contributions: 

Contributions 

Boxing LFTJ. We present and analyze a novel strategy 
we call boxing for out-of-core execution of a multi-predicate, 
worst-case optimal join algorithm (Leapfrog-Triejoin). This 
method exhibits the following properties: 

(1) For queries with n variables, executing on input data 
I and producing output data of size K, boxed LFTJ requires 
at most 0{\I\”'/{M"-^B)+K/B) block I/Os. We show that 
this bound is worst-case optimal, in the sense that for any 
n, we can construct a query such that no algorithm can have 
an asymptotically better bound with respect to 1 and K. 

(2) We further show that if the input data exhibits limited 
skew (in the sense we will make precise) then boxed LFTJ 
requires only 0{\iy'/B)-\-K/B) I/Os. Here, r denotes 
the rank of the query—a property we will dehne. The rank 
of a query never exceeds the number of variables used in the 
Datalog body, and is often lower. 

(3) We also analyze the computational complexity of boxed 
LFTJ. Here, we show that if the input size |/| is only a 
constant factor larger than the available memory M, then 
the asymptotic CPU work performed by the boxed LFTJ 
(essentially)]^ matches the asymptotic complexity of the in¬ 
memory LFTJ maintaining its theoretical guarantees. 
Boxed LFTJ-A. We apply boxing to the triangle-listing 
problem. Here, the input graph exhibits limited-skew if the 
degree of its nodes is limited by M/9. With lOOGiB of main 
memory this allows graphs containing nodes with up to 1.3 
million neighbors. 

On such graphs, our approach requires 0(|/|^/(Mi3) -f 
K/B) block I/Os, matching the asymptotic I/O bound of 
the recently presented specialized algorithm MGT for 
triangle listing. 

In-memory LFTJ-A. We also tighten the analysis for 
the CPU complexity of the conventional in-memory LFTJ 
applied to the triangle listing query with non-trivial argu¬ 
ments. It is easy to see that LFTJ-A’s achieved asymptotic 
complexity of 0{\E\^'^ log |i?|) is worst-case optimal modulo 
the log-factor. We improve on this result in two ways: 

(1) We show that for graphs G = {V,E) with an arboric- 
ity a(|i?|), LFTJ-A requires 0(|i5|a(|i?|) log |F|) work. A 

^Except when the in-memory LFTJ’s complexity is in o(|7|), 
in which case the boxed version’s complexity is 0(|7|). 


graph’s arboricity is a measure of its density (as we will ex¬ 
plain later) which never exceeds 0(\/|75|). Moreover, a is 
substantially smaller for many graphs ; for example, 

for both planar graphs and graphs with fixed maximum de¬ 
gree, a € 0(1). As a corollary, we thus obtain that LFTJ-A 
runs in 0{\E\ log |7?|) on planar graphs. 

(2) We further improve on the worst-case optimality anal¬ 
ysis: We show that even if we are only interested in families 
of graphs for which their arboricity is limited by any func¬ 
tion a G o{C'\E\), e.g. by 301og|75|, and we would like to 
design a specialized algorithm that (only) works (well) on 
these graphs, then this algorithm cannot have an asymp¬ 
totic complexity that is in o(d(|7?|)|75|). This result shows 
that LFTJ-A is worst-case optimal for any of these families 
(modulo the log-factor). 


Evalation. We further present an experimental evalua¬ 
tion, where we focus on the triangle query. We confirm 
that the boxing technique works well, especially when the 
input data is only a constant factor larger than the avail¬ 
able memory: on real-world and synthetic graphs with each 
more than 1.2 billion nodes, boxing only introduces little 
GPU overhead; and has good performance even when only 
limited main memory is available. We also compare the 


raw performance against two competitors: a specialized 32 


G-|-+-based implementation in the graph-processing system 
Graphlab and the specialized triangle listing algorithm 
MGT 10 . LFTJ is about 65% the Graphlab implementa¬ 
tion, yet scales to larger data sizes. When running single- 
threaded, LFTJ is on average 3x slower than MGT. Our 
parallelized version of LFTJ, however, is slightly faster than 
the single-threaded MGT (about 30% main memory is re¬ 
stricted to as much as 10% 

The rest of the paper is structured as follows: Section[^re- 
views the relevant background information. We present and 
analyze the boxing strategy for LFTJ in Section Sec¬ 
tion 1^ analyzes the in-memory and the boxed variant of 
LFTJ applied to the triangle query. Section highlights 
some important aspects of our implementation, before we 
experimentally evaluate our approaches in Section We 
review related work in Section [3 and conclude in Section 


2. BACKGROUND 

2.1 Review: Leapfrog-Triejoin (LFTJ) | |36| | 

LFTJ [36| is a multi-predicate join algorithm. Unlike tra¬ 
ditional binary join algorithms such as Hash-Join or Sort- 
Merge-Join which take two relations as input, LFTJ takes 
as input n relations together with the join conditions. 

Example 1 (LFTJ) Gonsider the query: 

Q{x, y, z) ^ R{x, y), S{x, z), T{y, z). 

With binary joins, we first join, e.g., R{x,y) with S{x,z) 
to obtain RS{x, y, z) and then join RS(x, y, z) with T{y, z) 
to obtain Q(x, y, z). LFTJ, on the other hand directly com¬ 
putes Q{x, y, z) given all predicates in the body of the rule as 
an input without storing any sizeable intermediary results. 

Some notation is necessary: for a binary relation R{x,y), 
let R{x,J) denote the set of values in the first column, i.e., 
R projected to its first attribute; further let Ra{y) denote 
the projection to the second attribute after only selecting 
tuples that have the constant a as the first attribute. 
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Figure 1: Trie and Trie Array of a ternary relation 
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LFTJ operates by first fixing an order of the variables 
occuring in the rule body. In our example, we might pick 
x,y,z as the order. Then, LFTJ finds all possible values a 
for the variable x. This is done by performing an intersec¬ 
tion of R(x,-) and S{x,J}, i.e., the first column of R with 
the first column of S because the variable x occurs in these 
atoms. Now, as soon as the first of such a is found, LFTJ is 
looking for values b of y, the next variable in the variable- 
order. Here, LFTJ computes the intersection of Ra{y) and 
T{y,_). Again, as soon as the first of such b is found, LFTJ 
is looking for values c for 2 by computing the intersection 
of Sa{z) and Ti,{z). If any of these c is found LFTJ reports 
the tuple ( 0 , 6 , c) in the output. Once the intersection of 
Sa{z) and Tb{z) has been computed, LFTJ back-tracks its 
search to the variable y and looks for the next b'. Back¬ 
tracking continues up to the first variable and LFTJ finishes 
when no new a' can be found anymore. A key to LFTJ’s 
performance is to efficiently compute the various intersec¬ 
tions. This is achieved via the method of a Leapfrog join 
(LFJ) which, as we detail below, leverages that relations are 
pre-sort ed. 


Trie representation for relations. It is convenient to 
think of relations to be represented as a Tri^ A Trie is a 
tree that stores each tuple of a relation as a path from the 
root node to a child node. See Fig. l(a)| for an example of 
a ternary relation with its trie in Fig. |l(b)| In general, a 
Trie for a relation with arity a has a height of a. For a re¬ 
lation R{xi ,..., Xa), the nodes at height i store values from 
the ith column of R. We require that children of the same 
node n are unique and ordered increasingly. For example in 
Fig. |l(b)] at level 2, the children of b are the values u, v, and 
w, which are in increasing order. 

Trielterators. LFTJ accesses relational data not directly 
but via a Trielterator interface. This not only allows var¬ 
ious storage scheme^ but also facilitates uniform handling 
of “infinite” predicates such as Equal, SmallerThan or Plus. 
The Trielterator interface provides methods to navigate the 
Trie of a relation. It can be thought of as a pointer to a 
node in the Trie. The detailed methods for Trie naviga¬ 
tion are given in Apx. A.l The methods are value() to 
access a data value; open() and Close() to move up and 
down in the trie. The linear iterator methods next, seek, 
and AtEnd are used to move within unary “sub-relations” A 
such as R{x,J) or Ra(y). Here, next moves one step right 
and SEEk() is used to forward-position the iterator to the 
element with value u; if u is not in A, then the iterator is 
placed at the element with the smallest value w > v. In 


^also called prefix tree, radix tree or digital tree 

®e.g., regular B+-Trees, sorted list of tuples, or the TrieAr- 

rays we describe later 


general, if the iterator passes the end of the represented re¬ 
lation such as Ra(y), the AtEnd will return true. A key to 
good LFTJ performance is that back-end data structures ef¬ 
ficiently support these Trielterator operations. In fact, the 
theoretical guarantees given by LFTJ require that value, 
KEY, AtEnd have complexity 0(1). Furthermore, seek and 
next must not take longer than O(logA) individually and 
must have an armortized cost of at most 0(1 -|-log(A/m)) if 
m keys are visited. Here, N stands for the size of the unary 
relation the iterator is for; eg, Ra{y)- 

Leapfrog Join. A basic building block of Leapfrog Triejoin 
(LFTJ) is Leapfrog join (LFJ). It computes the intersection 
of multiple unary relations. For this, LFJ has a linear it¬ 
erator for each of its input relations. An execution of LFJ 
is reminiscent of the merge-phase of a merge-sort; however 
instead of returning values that are in any of the inputs, 
we search and return values that are in all input relations. 
To do so efficiently, we use seek to iteratively advance the 
iterator positioned at the relation with the smallest value 
to the largest value amongst the iterators. If all iterators 
are placed on the same value, we have found a value of the 
intersection. 

Using LFJ to join n relations with Amin and Amax de¬ 
noting the cardinalities of the smallest and largest relation, 
respectively, has the following complexity: 

Proposition 2 (3.1 in |36| ) The running time of Leapfrog 
join is 0(Aniin log(Amax/Amin)) . 


The detailed algorithms for the Leapfrog join as well as 
LFTJ are given in Apx. as reference; for an even more 
detailed introduction and reference see [36| . 

Leapfrog TrieJoin Restrictions. LFTJ requires that no 
variable occurs more than once in a single body atom. This 
can be achieved via simple rewrites: Given a join with, e.g., 
the atom A = R{x, y, x) in the body, we introduce a new 
variable x' and replace A by R{x,y,x'),Eq{x,x') where Eq 
is the infinite equal-relation which itself is represented by a 
specialized Trielterator. 

As mentioned above, LFTJ is parameterized by an or¬ 
der on the variables of the join. This order is usually cho¬ 
sen by an optimizer as the exact order might influence run¬ 
time characteristics and can have an effect on the theoreti¬ 
cal bounds for the I/O complexity as we will detail below. 
Furthermore, the chosen order determines the sort-order of 
the input relations: In particular, arguments in atoms of 
the join body must form a subsequence of the chosen or¬ 
der. E.g., consider the order x,y,z: body atoms R{x,z) 
or Sijj) are allowed while the atom T{y,x) needs to be re¬ 
placed by an alternative index T 2 ^i{x,y) which is created as 
T 2 ,i{x,y) <r- T{y,x). These indexes are created in a pre¬ 
processing step. 


2.2 TrieArrays 

We use a simple array-encoding for Tries, which is inspired 
by the Compressed-Sparse-Row (CSR) format—a commonly 
used format to store graphs. As an example see Fig. |l(c)] for 
the representation of the trie given in Fig. |l(b) The data 
values are stored in flat arrays called uafue-arrays. Index 
arrays are used to separate children at the same tree level but 
from different parent nodes. An n-ary relation has n value 
arrays and n — 1 index arrays. In particular, the children 
of a node n stored in the value array vah at position j are 
stored in the array vafi+i starting at the index from idxi[j] 

































until the index idxi[j + 1] inclusively. E.g. in Fig. El the 
children w,x,y of w from vali[3] are stored in val 2 from 
idxi [3] = 4 to idxi [4] = 7. 

To reduce notation, we will often simply identify a relation 
R with its TrieArray representation and vice versa in the rest 
of the paper. For example, when we write a n-ary TrieArray 
we mean a TrieArray for an n-ary relation R. 

All Trielterator operations are trivial to implement for 
TrieArrays; except possibly seek, where some attention needs 
to be given to achieve the required armortized complexity. 
Here, instead of starting the binary in the middle of the re¬ 
maining sub-array, we probe with an exponentially increas¬ 
ing lookup sequence of eg., 1,4,16,64,... to narrow lower 
and upper bounds for the binary search. 

While the TrieArray representation is beneficial for exe¬ 
cution, it is also fairly cheap to create: 

Proposition 3 The TrieArray representation of a relation 
R requires no more than 0(|i?|) space and can be built in 
0{S0RT{R)) time and I/O complexity. 


Algorithm 1 Steps Performed by Leapfrog-Triejoin on the 
Triangles Query r(x, y, z) E{x, y), E{x, z), E{y, z). 

1: for a £ Pi n Vi do > Vi := {x \ {x, v) G E} 

2: for b £ Vi n D{a) do t> D{v) := {a: j {v, x) £ E} 

3: for c £ D{a) n D{b) do 

4: yield (a, b, c) > triangle found 
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(d) TrieArray T for i? = E{G*) 


Sketch. The space requirement is obvious; furthermore the 
data structure can be built from a lexicographically sorted 
R in two passes: pass 1 determines the sizes of the value and 
index arrays, pass 2 fills in data. □ 

2.3 LFTJ for Computing Triangles 

Given a simple, undirected graph G and let G* = (V, E) 
be its directed version, that is for each edge {u, 6} G E{G), 
E contains the pair (min{a, 6}, max{a, 6}). The query 

Tix, y, z) £- E{x, y),E{x, z), E{y, z),x <y <z. 

computes all triangles in G* of the form: a^^c. The output 
T coincides with the triangles in G. Since x < y < z is 
already implied by the atoms containing the edge relation, 
we can omit the inequality from the query obtaining: 



(e) Boxed Search Space (f) Example Box & TrieSlices 


Figure 2: Example for out-of-core technique for 

LFTJ-A, i.e. T{x,y, z)<—E{x,y), E{x, z), E{y, z) on E{G*) 


T{x, y, z) £- E{x, y), E{x, z),E{y, z). (A) 

3. BOXING LFTJ 

We hrst motivate our strategy by showing that LFTJ can 
suffer from excessive I/O operations in an external-memory 
setting with a block-based least-recently-used memory re¬ 
placement strategy. As example, we use the triangle query 
with specifically crafted input graphs. 

LFTJ on the triangle query. It is useful to highlight 
the steps that LFTJ performs for the triangle query @. 
These are summarized in Algorithm Note that Algo¬ 
rithm is not the pseudo-code of the program we use to 
list triangles; it only summarizes the steps LFTJ performs 
when run on the triangle query. First, the leapfrog join at 
level X for the atoms E{x, y) and E{x, z) computes the in¬ 
tersection between E{x, ) = V\ and E{x,T) = Vi. Then, 
for each found value a for x, we perform a leapfrog-join at 
level y computing the intersection of Ea(y) = D{a) with 
Vi = E{y,J), because the variable y occurs in the atoms 
E{x,y) and E{y,z). In the last step, we find bindings for z 
by intersecting D{a) = Ea{z) with D{b) — E/z) because z 
occurs in the atoms E{x, z) and E{y, z). 

Example inputs that causing excessive I/O. For N > 
M + B, consider the graph Gjv = {V,E) with edges E as: 

E = {(a;, y) I a: = 0,..., A and y = N — B{x mod T)} 


where T = M/B + 1 being slightly larger than the number 
blocks fitting into main memory at once. See Fig. 12(a)| in 
the appendix for an example with A = 24, M = 20, B = 4, 
and T = &. The key idea is that we place values in the second 
column of J? by J3 apart which will cause LFTJ to perform 
an I/O for every tuple in E for step 3 in Algorithm]^ fur¬ 
thermore, we make sure values in the second column repeat 
in groups large enough that loading all blocks in a group will 
preempt the first block from memory effectively prohibiting 
the algorithm to reuse the earlier loaded blocks. 


Proposition 4 LFTJ-A incurs at least 2|J5(Giv)| I/Os for 
the above defined graph Gn with a TrieArray data represen¬ 
tation and a LRU memory replacement strategy. 

Proof. See Apx. |B.1| □ 


3.1 High-Level Idea 

We now describe our out-of-core adaptation for LFTJ. 
LFTJ with a variable order xi,... ,x„ computes the join by 
essentially searching over an n dimensional space in which 
each dimension i spans over the domain of the variable Xi. 
Loosely speaking, the space is searched in lexicographical 
order. As the example above demonstrates, this can lead 
to excessive I/O costs. Further I/O accesses are caused by 
the potentially non-local accesses for the binary searches of 
leapfrog-join. 











































































In our approach, we partition the n-dimensional search 
space into “hyper-cubes” or boxes such that the required data 
for an individual box fits into memory. LFTJ is then run 
over each box individually—finding all input data ready in 
memory. We strive for the following properties: (i) Deter¬ 
mining box-boundaries is efficient: both in CPU and I/O 
work, (ii) Loading data that is restricted to a box is effi¬ 
cient, again, both in terms of CPU and I/O work, (iii) The 
total amount of data loaded is minimal. 

Fig.|2] illustrates this strategy for LFTJ-A. The join 
uses three variables x, y, z - resulting into a 3-dimensional 
search-space. If the input graph G represented via a Trie Ar¬ 
ray does not £t into the available memory, then we partition 
the search space into boxes, for example as in Fig. 2(e)| The 
partitioning is chosen such that the input data restricted to 
an individual box fits into memory. LFTJ-A is then run for 
each box individually one after another while join results are 
written append-only in a streaming fashion. 

We now explain the different aspects in detail. 


3.2 TrieArray Slices 

We assume that input data is given on external storage in 
a TrieArray representation, with the attribute order consis¬ 
tent with the chosen key order for LFTJ. This can easily be 
achieved via a pre-processing step that costs 0{S0RT{\1\) 
block I/O and CPU steps. When loading data for a single 
box into main memory, we directly operate on the TrieAr¬ 
ray representation to subset the data. The remainder of this 
subsection shows that this step can be done very efficiently. 

In general, applying any selection ct to a TrieArray for a 
predicate R to obtain a TrieArray for cr(J?) can be done in 
0(|i?|) cpu work and 0{\R\/B) I/Os if a{t) can be com¬ 
puted in 0(1) time and space for tuples t £ R. This is 
because TrieArrays can be used to efficiently enumerate the 
represented tuples in lexicographically order, and they can 
also efficiently be built from lexicographically sorted tuples. 

We are interested in certain range-based selections. It 
turns out that these can be built even faster—with costs 
proportional to the selected size |cr(i?)| rather then the total 
data set size | (modulo log-factor), or even less depending 
on the cost-model. 


Example 5 (TrieArray Slice) Consider the binary rela¬ 
tion E from Fig. 2(c) and its TrieArray T in Fig. |2(d)' We 
are interested in the subset 5" of T that restricts the first at¬ 
tribute to the interval [3, 5], i.e., S = {{x,y) £ T \x £ [3, 5]}. 
We call this a slice of S at level 0 from 3 to 5. A TrieArray 
for this slice is shown at the top of Fig. |2(f)[ To build this 
slice, we can simply copy the values in valo for the interval 
[3, 5]; then look up where the corresponding y values are in 
idxo, and copy these as well. The index value cannot sim¬ 
ply be subset because the positions need to be shifted by 
the amount we cut off from vali in the front: the first index 
in idxo of the slice should read 0 instead of 5. However, in¬ 
stead of changing the values, we add a wrapper to the index 
arrays that can subtract the offset (here 5) during accesses 
dynamically. Then, all data used in the arrays of the slice 
are simply sub-arrays of the original data. 


In general, for an n-ary relation R, we are interested in 
creating slices at a level k, 0 < k < n. At level k the values 
are restricted to an interval given by a low-bound I and a 
high-bound h; at levels 0,..., A: — 1, the slice contains only 
a single element each. Formally: 


Definition 6 (Slices) Let R be an n-ary relation, 0 < k < 
n an integer, s be a k-axy tuple, and I and s be two domain 
values. The Slice S of R at level k for s from I to h (in 
symbols S = Ri^h) is the defined as: 

S = {(*0, ■■■,Xn-\) G R\ {xo, ...,Xk-i) = s and I < Xk < h} 

We often do not mention the level explicitly as it is evident 
from the start tuple s; also, if fe = 0, we simply say “Slice 
for R from I to h”. 

We create and store Slices in the TrieArraySlice data struc¬ 
ture, which is a conventional TrieArray—except that the 
index arrays can be parameterized with an offset to per¬ 
form dynamic index-adaptation as explained in the example 
above. As with TrieArrays, we identify the Slice (set of tu¬ 
ples) with the TrieArraySlice data structure and vice versa 
in the rest of the paper. 

Given a relation R on secondary storage, we can create 
slices of R efficiently: 

Proposition 7 (Slice provisioning) Let R be an n-ary 
relation stored on secondary storage as a TrieArray; k,s,l, 
and h be as in Def. Then, the slice S = Ri-^h can be 
loaded into memory with 0(log|i?| + |S'|/H) block I/Os and 
0(log|i?|-b |S|/B) CPU work, if it fits. 

Sketch The provisioning process is as follows: using k bi¬ 
nary searches on the value arrays valo, • • ■, valk-i, we locate 
the prefix s in R-, the slice is empty if the prefix does not 
exist. Then, using two more binary searches we locate the 
smallest element I' > I and the largest h' < h in valk of 
R. Their positions are the boundaries in valk and idxk 
for the interval we copy into the slice. For the remaining 
n — k value arrays and n — A: — 1 index arrays, we iteratively 
follow the pointers within the idx arrays and copy the ap¬ 
propriate ranges. As a last step we adjust the index-array’s 
offset parameter: for each j = k,... ,n — 2, we set the offset 
parameter of idxj to —idxj[0]. 

We require 0(log|i?|) I/Os for the binary searches and 
0(|S|/i3) I/Os for copying the continuous values from the 
arrays with indexes > k. Similarly, the binary searches re¬ 
quire 0(log|7?|) CPU work; the remaining CPU work ac¬ 
counts for requesting the copy operations. □ 

Note that besides the logarithmic component, provisioning 
a slice amounts to simply copying large, continuous arrays 
from secondary storage into main memory. On modern 
hardware, these can be done using DMA methods without 
causing any significant CPU work. Moreover, modern ker¬ 
nels might simply memory map the to-be-copied pages and 
perform actual copies only when pages are modified. 
Probing. As the last building block, we are interested in 
provisioning slices that will fill up a certain budgeted amount 
of memory. In particular, we specify the prefix-tuple s and 
lower bound I as before. But instead of providing an upper 
bound h, we give a memory budget m in blocks as shown in 
Fig. 0 We are then interested in a maximal upper bound 
h > I such that the slice at s from I to h requires no more 
than m blocks of memory. Note that for skewed data, it is 
possible that the slice requires more than m blocks of 
memory, even when h — 1. Should this case occur, we report 
via the sentinel value SPILL instead of returning an upper 
bound h. Not surprisingly, probing is also efficient: 











function PROBE(r, s, I, m) returns h 
in: n-ary TrieArray T [> on secondary storage 

fc-Tuple s t> start tuple for attributes 0,.., k — 1 

value I > Lower bound for attribute k 

int m > memory budget in blocks 

out: Maximal h > I such that the slice Ti%f^ occupies 

< m blocks of memory, or SPILL if no such h exist. 


Figure 3: Interface for Single Slice Probing 


1: I < - 00 > Value at the start of the search space 

2: repeat 

3: probe R, S, T from I for upper bounds ha, hs, hr 

4: hmin(h_B, hs, hr) 

5: provision R, S, T from I to h 

6 : run LFTJ on the provisioned slices 

7: I <r- succ(/i) [> lower bound is successor of old upper 

8 : until oo = h t> until we have searched all space 


Figured: Example: Boxing for i?(a;), S'(x), r(a;) 

Proposition 8 For a TrieArray T on secondary storage, 
probing the upper bound for a Slice to fill up a memory bud¬ 
get as described in Fig.^requires 0(log|r|) I/Os and CPU 
work. 

Sketch Similar to slice provisioning, except that we do a 
binary search for the upper bound and check for each guess 
how many blocks the TrieSlice would occupy. This can be 
done by following the idxi pointers. Determining the size 
of the TrieSlice for each guess requires at most 0(n) I/Os 
where n is the arity of T. Since we binary search in valk, 
an array that is at most size |i?|, we obtain the required 
complexity of 0(log |i7|). □ 

3.3 Boxing Procedure 

To help exposition, we first describe aspects of the boxing 
approach via examples, before we cover the general case. 
Joins with one variable. Consider a join over multiple 
unary relations such as 

Q{x) ■«— R{x),S{x),T{x). 

Imagine each of the body relations is larger than the avail¬ 
able internal memory M. We can divide the internal mem¬ 
ory into four parts, one for the output data and one for each 
of the input relations. Since the output is written append- 
only, a relatively small portion of memory, which is written 
to disk once it fills up, is sufficient. We thus divide up the 
bulk of the memory for the three input relations. We can 
use the simple strategy to evenly divide the space. A boxed 
LFTJ execution would then simply alternate probing, pro¬ 
visioning, and calling LFTJ as described in Fig.|^ 

Not surprisingly, this approach would work well for the 
limited class of joins: for reading the input, it requires a 
number of I/Os bound by 0{\I\/B + |/|/Mlog|/|) with |/| 
being the combined size of the input relations. The key 
observations for showing the bound is that in each iteration 
(except possibly the last), at least one relation will load 
0{M) (in our example around M/3) tuples using 0{M/B) 
block reads. Now, since there are only |7| tuples in the input, 
there are at most 0(|7|/M) iterations. Since each probing 


variables: m-Tuple low, high [> Box boundaries 

1: procedure Main 
2: BoxUp(I) 

3: procedure BoxUp(int i) > i corresponds to Xi 

4: low[i] - 00 

5: repeat 

6 : probe inputs Ri from low[i] for upper bound hi 

7: high[i] -s— hi 

8 : provision Ri from low[i] to high[i] 

9: if i < m then : BoxUp(i-l-l) 

10: else: run LFTJ on slices t> Box: [low-•-high] 

11 : low[i] •<— succ(high[i]) 

12 : until oo = high[i] 


Figure 5: Example: Boxing for R\{x\),..., Rm{xm) 

can be done in 0(log |7|) we obtain the desired boumQ 
Unary cross-products. Consider the cross-product of m 
unary relations, with each relation larger than M: 

Q(,X\ , Xjri') ^ Rl (^ 1)5 ■ • ■ ; Rm (Xm) • 

We again split the bulk of the available memory across the 
m input relations. The boxing procedure is recursive where 
each dimension i of the recursion corresponds to a variable 
Xi (See Fig. [^. The procedure starts with i = 1. In general, 
at a dimension i, we loop over the predicate Ri via the probe¬ 
provisioning loop. Then, for each slice at dimension i, we 
do the same recursively for the next higher dimension. At 
the bottom of the recursion—when we reached the i = m, 
we call LFTJ on the created slices. Then, the slices provide 
data for the box [low-high], i.e., in which the variable Xi can 
range from low[i] to high[i]. Note that (like above) we can 
run the original query over the slice data since the slices are 
guaranteed to not have data outside their range and thus 
the boxes partition the search-space without overlap. 
General joins. The general approach combines the two 
previous algorithms while also considering corner cases. Let 
Q be a general full-conjunctive join of m atoms, and vari¬ 
able order tt = xi,..., Xn with no atom containing the same 
variable twice, and all atoms in Q mentioning variables con¬ 
sistent with TV. We first group the atoms based on their first 
variable xj: we place all atoms that have as first variable 
Xj into the array atoms[l..n] at position j. To follow the 
exposition, consider the join 

Q{xi,X2,X3) ■«- R{xi,X2),S{xi,X3),T{x2,X3),U{xi) 

where we put R, S, and U into atoms[l] and T into atoms[2]. 
Like for cross-products, we recursively provision for the di¬ 
mension i ranging from 1 to n. For each i, we use the method 
for joining unary relations for the atoms in atoms[i]. In par¬ 
ticular, for each Aj £ atoms[i] we probe and create slices 
for Aj at level 0 regardless of i or the arity of Aj. Thus, at 
dimension i, we iteratively provision atoms with Xi as their 
first attribute restricting the range of Xi but not any of the 
other variables Xk, k > i. This ensures that we can freely 
choose any partitions we might perform on these variables 

^Note that with a simple caching strategy for the provision 
step (always cache the block containing h and reuse in the 
next provisioning if possible), we could make the argument 
that each block is read at most once by the provisioning step 
obtaining the same asymptotic bound. 











Xk for k > i. Like with cross-products, we call LFTJ at the 
lowest level when i = n. 

The above works well unless any of the probes reports a 
SPILL, which can occur if a relation exhibits significant skew. 
For example, imagine there is a value a for which |S'a( 2 : 3 )| 
exceeds the allocated storage. Then, at dimension i = 1, 
probing S at level 0 with a lower bound a will return SPILL. 
We handle these situations by setting the upper bound at 
level i = 1 to a, and essentially marking Sa as a relation 
that needs to be provisioned at the dimension of its second 
attribute (eg, 3) alongside the atoms in atoms[3]. Note that 
a relation of arity a can spill a — 1 times in worst case. 

The general algorithm is given in Algorithm We evenly 
divide the available storage among the n dimensions, and 
assign the atoms A to atoms[i] accordingly (lines 3-4). We 
also use a variable leftoverMem to let lower dimensions uti¬ 
lize memory that was not fully used by higher dimensions. 
In line 11, we union the spills from the previous level to the 
atoms we need to provision. The method probe in line 12, 
probes atoms in atms to find an upper bound such that all 
atoms can be provisioned. We here, evenly divide mem by the 
size of atms. The lower bound for probing are taken from 
low, which is also used to determine the starting tuples for 
possible spills. The method sets the upper bound at the cur¬ 
rent dimension and fills the spills predicate if necessary. The 
method provision provisions the predicate A with bounds 
from low and high adapted to the variables occuring in A. 
It returns the slice and the size of used memory. 

3.4 I/O Complexity of Boxing 

We now analyze the Boxing approach to obtain complexity 
bounds on the number of block I/Os. Since we concentrate 
on full conjunctive queries, every output tuple is computed 
exactly once by LFTJ. As explained above, we use some 
constant-size buffer to let the I/O cost for the output be 
K/B where K is the output size. We now analyze the cost 
of the I/Os for reading input data. 

For each dimension i, i = 1,..., n, let Li be an upper bound 
on how often the repeat-until loop from lines 9-23 of Algo¬ 
rithmic is executed for a single invocation of the surround¬ 
ing BoxUp procedure. Li is determined by how often we 
need to provision to completely iterate through the atoms 
in atoms[i] U spill[i]. In each step (except possibly the last) 
at least one of the input predicates Aj loads 0{M) tuples— 
this is the predicate that determines the high bound high[i]. 
In case there is no spill, this is immediately clear; but it 
is even true if a predicate is being spilled because its tu¬ 
ples are then “consumed” at a higher dimension. Note, that 
at the last dimension, no spills can occur. We thus have 
Li G 0{m\l\/M) — 0{\I\/M), where m is the number of 
atoms in the join. 

Let us now determine how often for each dimension BoxUp 
is called. We denote this number by Bi. The outermost 
BoxUp is called once; BoxUp(2, .) is called once for each it¬ 
eration of the repeat loop at level 1, that is Li times. In gen¬ 
eral, Bi = s-nd consequently Bi £ 0 (|/|““^/M®“^). 

It is convenient to inject the following observation: 

Lemma 9 The number of boxes created by a boxed LFTJ 
with n variables on input I is Od/p/M"). In particular, if 
|/| G 0{M) then the number of boxes is 0(1). 

Proof The number of boxes equals the number of loop ex¬ 
ecutions at dimension n, which is bound by LnBn- □ 


Algorithm 2 Boxing Leapfrog Triejoin 

in: memmax t> available memory in blocks 

Ai,, Am t> body atoms and TrieArrays 

Xi,... ,Xn > key order, n variables 

variables: 

n-Tuple low, high > Box boundaries 

Array of AtomSet atoms[l..n] t> atoms per level 
Array of SliceSet S[l..n] > provisioned slices 

Array of AtomSet spill[0..n] ;> spilled-over atoms 
Array of int budget[l..n] t> of memory in blocks 

1: procedure Main 

2: for i G {1,..., n} do 

3: budget[i] ■<— memmax/n 

4: atoms[f] £- {Ai \ Xi is first variable in Ai] 

5: BoxUp(1, 0) [> 1st variable, no leftover memory 

6 : procedure BoxUp(i, leftoverMem) 

7: mem budget [i] + leftoverMem 

8 : low [— 00 ,..., —oo]; high •<— [oo,..., oo] 

9; repeat 

10: S[i] 0 ; usedMem 0 

11: atms ■£- atoms[f] U spill[f — 1] 

12 : spill[i], high[i] t— probe(i, atms, mem, low) 

13: for A G atms \ spill[i] do 

14: slice, m ■£- provision(A, low, high) 

15: usedMem usedMem + m 

16: S[i] S[i] U slice 

17: if i < n then 

18: leftoverMem ■£- mem — usedMem 

19: BoxUp(i -I- 1, leftoverMem) 

20: else 

21: run LFTJ on [J S[k] on Box[low---high] 

k = l..n 

22 : low[i] t— succ(high[i]) 

23: until oo = high[i] 


Back to the I/O costs. Consider only the I/O that is per¬ 
formed directly in a certain BoxUp call without counting 
the cost in the recursive calls from line 19. First, we count 
provisioning only. Here, during the evaluation of the repeat 
loop (lines 9-23), we load the data in atoms).] U spill[.] C /. 
Similarly as in the case of joins with one variable, we can 
cache the last blocks containing the last tuple of the provi¬ 
sioned TrieSlices, and thus load each block from the input 
exactly once. Consequently, the I/O work done to provi¬ 
sion directly in each invocation of BoxUp(i) is limited by 
0{\I\/B). The I/Os necessary for probing can be bound by 
0{Li log j/j) = 0{\1\/M log |7|) since we probe at most m re¬ 
lations once for each execution of the repeat loop. If we use 
the assumption that \I\/B is larger than log j/j as explained 
in Section]^ we thus obtain 0{\I\/B) as I/O cost directly 
at dimension i for a single BoxLevel call. As last step, we 
multiply by Bi to obtain the total I/O cost Ci at dimension 
i as Ci G 0(|///(M®“^i3)). Since output is written once 
and we consider joins without projections we obtain: 

Theorem 10 The I/O complexity of boxed LFTJ with n 
variables, input I and output of size K is 0{\I\"B) + 
KfB). 

Optimality. This complexity is optimal when only the 
number n of variables is used to characterize the query. This 







is because the Cartesian product of n relations can produce 
0(|/|”) output which requires 0(|/|"/i3) block writes. 

Furthermore, in practice, the input is often only by a con¬ 
stant factor larger than the available memory: 

Corollary 11 The I/O complexity of boxed LFTJ for any 
query on input I and output of size K is 0(|7|/B -I- KfB) 
if I e 0{M). 

This (better looking) bound is, obviously, optimal for queries 
that require reading the entire input. 

No spills. If the execution does not produce any spills, we 
can strengthen the general result. To do so, we quickly need 
to introduce a property of queries: 

Definition 12 The rank r^{Q) of a query Q conforming 
to the key-order tt = xi, ..., Xn. is the largest j for which Q 
contains an atom with Xj as first variable. The rank r{Q) 
of Q is the minimum of r.,r(Q) where n is any key-order. 

Clearly, the rank of a query (for any key-order) is bound by 
the number of variables—but sometimes smaller. E.g., for 
the triangle query rx,y = 2, but also r([^ is 2. Note 
that r,r(Q) is the largest i for which atoms[i is non-empty 
when boxing Q with key-order tt . 

Theorem 13 If no spills occur during a boxed execution 
of LFTJ for the query Q with key-order tt , then the total 
I/O cost is 0{\l/B) -\- K/B) where |/| denotes the 
combined size of the input relations, K the combined size of 
the output relations and i = r,,{Q) is the rank of Q for tt . 

PROOF At dimensions i > i, there are no I/O operations 
since both atoms[i] and spill[i] are empty, obtaining the 
desired result by summing up Ci for i < £. □ 

Spills occur in the boxed LFTJ execution if there is an input 
relation R and value a for which the Slice Ra^a exceeds the 
size of the memory allocated for R. We can thus characterize 
when they occur. For a query with n variables and m atoms: 
Let M' be the memory used for the body of the query. If we 
divide up all space evenly among all n variables, and for each 
dimension, evenly among all m predicates, then the critical 
value for any |7?a| is approximately M'/{2nm{k — 1)), since 
the slice for 7?a-»a has a size of at most around 2{k — l)\Ra\. 

3.5 CPU Data-Complexity of Boxing 

The CPU work performed by a boxed LFTJ on input data 
I falls into two categories: (1) the work necessary to deter¬ 
mine the number of boxes and to provision them, and (2) 
the work done by the in-memory LFTJ executing over the 
boxes. For an input I, the asymptotic work in category 2 
is trivially bound by the asymptotic work of the in-memory 
LFTJ on 7 multiplied by the number P of boxes used, sim¬ 
ply because each invocation uses input that is a subset of 
7. For the work in category 1: deciding on the bounds of 
a single box is done in 0(log |7|), copying its required data 
takes no more than 0(|7|) resulting into a total upper bound 
of 0(P|7|) for P boxes. 

Using Lemma we can thus conclude: 

Theorem 14 On inputs I that are only by a constant factor 
larger than the available memory M, the asymptotic com¬ 
putational data-complexity of the boxed LFTJ matches the 
one of the in-memory version of LFTJ or is linear in \I\, 
whichever is worse. 


4. LFTJ APPLIED TO TRIANGLE LISTING 


4.1 Boxed LFTJ-a 

From Corollary |11[ we immediately get an I/O complexity 
of 0{\E\/B J- K/B) if |7?| G 0{M). Without this assump¬ 
tion, plugging the triangle query into Theorems |10| and 
1131 we obtain: 


Corollary 15 The boxed LFTJ applied to the triangle query 
has an I/O complexity of 0{\E//{M'^B) J- K/B). If there 
are no spills the complexity is 0{\E//{MB) PK/B). 


With no spills, boxed LFTJ thus matches the I/O complex¬ 
ity of MGT 1^, which is optimal if M > \V\ as shown in 
[10| . From above, we know that spills only occur if there is a 
single node that has more than around M/18 neighbors, for 
5GiB of allocated memory and 64bit node ids, this amounts 
to an upper limit of 37 million neighbors per node, a num¬ 
ber that is seldom reached in practice. Interestingly, the core 
MGT algorithm in [10| also requires that the node degree is 
limited. MGT achieves the bound without restrictions by 
deploying a pre-processing step. 

For the compute complexity of boxed LFTJ-A, we rely 
on Theorem |14[ expecting to essentially maintain the per¬ 
formance of in-memory LFTJ-A assuming |7| G 0{M). 


4.2 In-Memory LFTJ-a CPU Complexity 

In this section, we use the conventions that G = {V, E) 
is always the input graph. While the previous section was 
specihc to our version of LFTJ that uses Trie Arrays, the 
results here apply to all LFTJ implementations as long as 
the basic Trielterator operations adhere to the complexity 
bounds given in [3m and restated in Section 


with little work directly from [^ and [^: 


2.1 


Following 


Proposition 16 LFTJ-A’s computational complexity on in¬ 
put E is 0(|7?|^ ® log |7?|), which is optimal modulo the log- 
factor. 

Proof. See Apx. |B.2| □ 


The rest of the section, strengthens this result by analyz¬ 
ing the complexity of LFTJ on families of graphs that are 
characterized by the number of edges and their arboricity. 
The arboricity a{G) of an undirected graph G is a standard 
measure for graphs, counting the minimum number of edge- 
disjoint forests that are needed to cover the graph. A classic 
result by Nash-Williams [^ links this number to the graph’s 
density by showing that no subgraph 77 of G has more than 
A:(|U(77)| — 1) edges if and only if a(G) < k. In general, a 
is in 0{^/\E/\) m for any graph G = {V,E). However, in 
many real-worldgraphs, a is significantly smallerj^ 15 


It turns out that the runtime-complexity for LFTJ-A is re¬ 
lated to the graph’s arboricity with LFTJ-A behaving better 
the smaller a is. It thus makes sense to consider LFTJ-A’s 
complexity for graphs characterized by an upper bound on 
their arboricity. For compatibility with the asymptotic com¬ 
plexity, we bound the graph’s arboricity with respect to their 
edge-size: 


Theorem 17 Let d : N —>■ N be a monotonically increasing 
function. Then, LFTJ-A runs in 0{ma(m)\ogm) time on 
graphs with at most m edges and arboricity of at most a{m). 

Proof. See Apx. |B.3| Analyze the work done by the 
leapfrog joins at levels x, y, and 2 . Only the third level is 






interesting, where we use a result by that gives an upper 
bound of 2a{G)\E\ for the sum min{ci(x), d{y)}. □ 

Clearly, if the maximum degree of our graphs is bounded, 
than their arboricity is in 0(1). Furthermore, the arboricity 
of planar graphs is also in 0(1) [^, immediately leading to: 

Corollary 18 LFTJ-A lists triangles in 0(|i?| log |i?|) steps 
for planar graphs and for graphs with bounded degree. 

We can also amend the optimality result from Prop. |16| 
showing that LFTJ-A remains optimal (modulo log-factor) 
even when considering graphs with a limited arboricity: 

Theorem 19 Let a : N —>■ be a monotonically increas¬ 

ing, computable function that is not identical to 1 and in 
o{y/n). Then, no algorithm that lists all triangles for input 
graphs G = {V,E) with arborieity of at most Q:(| J5|) can run 
in o(|_E|d(|Fl|)) time. 

Proof. See Apx. |B.4[ It turns out that for any such 
a, we can construct large graphs that have 0(|i5|d(|i?|)) 
triangles. □ 

We highlight that the above theorem is quite general. It 
only requires the alrorithm to be correct for input graphs of 
restricted arboricitjju For example, even if we (somehow) 
knew that all our input graphs have an arboricity a bound 
by, say, 421og|i?|, we could not design a specialized algo¬ 
rithm that only works on these graphs and has a runtime 
complexity of o{\E\ log |i5|). 

The optimality from Theorem does unfortunately not di¬ 
rectly follow from the worst-case optimality of LFTJ for fam¬ 
ilies of instances that are closed under renumbering (Thm 4.2 
in[^), because the optimality in was obtained when 
each relation symbol appears only once in the body of the 
join, a property used in the proof for Thm 4.2 of [36| . 

5. IMPLEMENTATION 

We have implemented a general-purpose join-processing 
system with LFTJ at its core. To highlight its general¬ 
ity, we briefly list its current features. We support mul¬ 
tiple fixed-size primitive data types including int64, dou¬ 
ble, boolean, and a fixed-point decimal type. Predicates 
(stored as TrieArrays) can have variable arities and we sup¬ 
port marking a prefix of the attributes as key (the TrieArray 
then needs fewer index arrays). Predicates support loading 
and storing from and into CSV files. Besides materialized 
predicates that store data, we have Trielterator implementa¬ 
tions for various “builtins” such as comparison operators and 
arithmetic operators. Using a simple command-shell, joins 
such as the triangle query can be issued in a Datalog-like 
syntax. We require the written joins to have atoms with 
variables consistent with a global key order. At the head 
of rules, we support optional projections, and some aggre¬ 
gations. The system uses secondary storage (via memory- 
mapped files) to allow processing of data that exceeds the 
physical memory; and deploys the here presented Boxing 
technique. We have not implemented a query optimizer (to 
find good key orders), nor do we currently support mutating 

® Except for the corner-case where the arboricity is bound by 
1 , in which case the graphs have no triangles and an 0(1) 
algorithm trivially exists. 


relations, also we do not support transactions. In the fol¬ 
lowing, we highlight aspects of the system that likely have 
an impact on performance, yet whose detailed analysis and 
description goes beyond the scope of this paper. 

Removing interpretation overhead. Datalog queries 
that are issued are compiled to optimized machine-code and 
loaded as a shared library into the system. Our code still 
uses the Trielterator interfaces but most code is templatized: 
predicates by their arity, key-length and types of the at¬ 
tribute; Trielterators by their types and arity; the LFTJ by 
the key-order, Trielterators of body atoms as well as each 
of their variables; a rule by the LFTJ for processing the 
body and the classes that perform so-called head-actions. 
Using this approach, we can still program with the conve¬ 
nient Trielterator interfaces—yet allow the C-|—I- compiler 
to potentially inline join processing all the way down to the 
binary searches using the appropriate comparison operators 
for the type at hand. 

Misc Optimizations. We are also deploying a paralleliza¬ 
tion scheme for LFTJ to utilize multiple cores. In the boxed 
LFTJ version, boxes are worked on one after another, yet 
LFTJ utilizes available cores while processing a single box. 
We will also provide single-threaded performance when com¬ 
paring with single-threaded competitors. 

Even though dividing the available memory evenly across 
the dimensions is sufficient to obtain the asymptotic com¬ 
plexity bounds, using more memory at smaller dimensions 
reduces the number of boxes created. Note that as long as 
the memory used at each dimension is a constant fraction 
of the total memory, the complexity bounds remain in tact. 
We picked a ratio of 4:1 for dividing up the memory between 
X'.y in the triangle query. We also do not allocate budget to 
dimensions j that do not have an atom using Xj as first 
variable (eg, z). This is fine since in case there is a spill the 
budget for the spilling relation will be moved over to the 
next dimension. 

If there are two atoms referring to the same relation and 
having the same first variable, we naturally only provision 
and create one slice for them. For example in the triangle 
query, we probe and provision a single relation E at dimen¬ 
sion X for the atom E{x, y) and the atom E(x, z). Of course, 
in the case of spills they might get untangled at higher di¬ 
mensions. We do not exploit the fact that the third atom 
E{y, z) refers to the same relation. 

We envision that for some queries, an optimizer, aided by 
constraints provided by the user, can avoid provisioning cer¬ 
tain boxes because it can infer that there cannot possibly be 
a query result within that box. For example, in our case, 
we know that x < y < z. This can easily be inferred from 
the constraint a < b for any (a, b) € E. Based on this, we 
do not need to provision at dimension y if the high bound 
for y is smaller than the low bound for x. We have put a 
hook into the boxing mechanism to bypass provisioning if 
after probing this condition is met. A detailed exploration 
of constraints and their interactions with probing and pro¬ 
visioning is beyond the scope of this work. 

6. EXPERIMENTAL EVALUATION 

In our experimental evaluation, we focus on the triangle 
listing problem. Here, we investigate the following questions: 

(1) What is the CPU overhead introduced by boxing LFTJ? 

(2) How well does boxed LFTJ cope with limited available 
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Figure 6: Characteristics of the used data sets. 


main memory, how does vanilla LFTJ do? (3) How does 
LFTJ compare to best-in-class competitors? 

Evaluation environment. We use a desktop machine with 
an Intel i7-4771 core, that has 4 cores (8 hyper-threaded), 
each clocked at 3.5GHz. The machine has 32GB of physical 
memory and a single SSD disk. It is running Ubuntu 14.04.1 
with a stock 3.13 Linux kernel. 


Data. We use both real-world and synthetic input data 
of varying sizes. The data statistics are shown in Fig. 
The smallest dataset we consider is “LJ”, which contains the 
friend-ship graph of the on-line blogging community Live- 
Journal T4| 38 . Next, “Orkut” is the friend-ship graph of 
the free online community Orkut [^ [^. ‘TWITTER’ is 
one of the largest freely available graph data sets. It con¬ 
tains the as-of-2010 “follower” relationships among 42 Mil¬ 
lion twitter users [^. The dataset has 1.47 billion of these 
relations, which we interpret as undirected edges in a graph, 
resulting in 1.2 billion edges. This dataset contains almost 
35 billion triangles. Unlike the first two data sets, which we 
obtained from [^, twitter was gathered from [^. We also 
consider synthetically generated data due to its better un¬ 
derstood characteristics. We focus on two datasets: ‘RAND’ 
and ‘RMAT’. Each comes in a medium-sized version with 
16 million nodes and 256 million edges and a large version 
with 80 million nodes and 1.28 billion edges. In the ‘RAND’ 
dataset, we create edges by uniformly randomly selecting 
two endpoints from the graph’s nodes. The ‘RMAT’ data 
contains graphs created by the Recursice Matrix approach 
as proposed by Ghakrabarti et al.[^. This approach creates 
graphs that closely match real-world graphs such as com¬ 
puter networks, or web graphs. We used the data generator 
available at with its default parameters. The LiveJour- 
nal and the synthetic graphs were also used by the MGT 
work in 10 and earlier work to evaluate out-of-core per¬ 
formance for the triangle listing problem. All graphs have 
been made simple by removing self and duplicate edges. The 
GSV sizes in Fig. [prefer to the CSV data where each undi¬ 
rected edge {a, 6} is mentioned only once. TA stands for our 
TrieArray representation as described in the earlier sections. 
We use 64 bit integers per node identifier. 


Methodology. We measure and present the time for run¬ 
ning the algorithms on the mentioned data sets with var¬ 
ious configurations and memory restrictions. We will run 
our TrieArray-based implementation of LFTJ with various 
configurations and two competing algorithms. Since all al¬ 
gorithms need to report the same number of triangles, we 
essentially run them in “counting-mode” and we thus do not 
account for the time nor the I/O it takes to output the tri¬ 
angles. This was also done in [^. Input data for LFTJ 
is given in TrieArray format; we do not include the time it 
takes to create the TrieArray from CSV data (which can be 
done in at most two passes after sorting the data). 

What CPU overhead does Boxing introduce? To 
measure the CPU overhead that is introduced by the boxing 


approach, we advise LFTJ to only use memory the size of a 
fraction of the input during execution—yet, we do not place 
any limit on the caches the operating system keeps for file 
operations. To further (almost completely) remove I/O, we 
prefix the execution by cat-ting all input data to /dev/null, 
which essentially pre-loads the Linux file-system cache. We 
now consider the two questions (i) What is the CPU over¬ 
head for probing and copying? and (ii) What is the overhead 
introduced by running LFTJ on individual boxes in compar¬ 
ison to running LFTJ on the whole input data. To answer 
the first question, we simply run three variants: (a) the full 
LFTJ, (b) probing and copying data into TrieArraySlices 
without running LFTJ, and (c) only probing without copy¬ 
ing input data nor running LFTJ. Results are shown in the 
first row of Fig. On the X-Axis, we vary the space avail¬ 
able for boxing. The individual points range from 5,10,... 
up to 200% of the input data size in TrieArray represen¬ 
tation. We chose to range up to 200% since the input is 
essentially read twice by LFTJ-A: once for each of the di¬ 
mensions X and y. 

Results. Answering question (i): We can see that the CPU 
work performed for probing and copying is very low in com¬ 
parison to the work done by the join evaluation, even when 
the box sizes are limited to as little as 5% of the size of the 
input. Answering (ii), we look at the red lines for LFTJ and 
compare the curve with the value at the far right as this one 
is achieved by using a single box. The real-world data sets 
behave as expected: starting at around 25%, they level out 
demonstrating that the CPU overhead is low if the available 
memory is not too much smaller than the input data size. 
Now, for the synthetic data sets, we see that unexpectedly, 
using more boxes reduces the CPU work (memory range 
10%-200%). We speculate that this is because the boxed 
version might reduce the work done in binary searches for 
seek since the space that needs to be searched is smaller. 
Only at 5%, does this trend reverse and using more smaller 
boxes takes longer. 

How well does Boxing do with limited memory? We 

are also interested in the performance of the boxing tech¬ 
nique when disk I/O needs to be performed. Here, we run 
the same experiments as above but we clear all linux sys¬ 
tem caches (see Apx. |C.1[ ) before we start a run. We further 
use Linux’s cgroup feature to limit the total amount of RAM 
used for the program (data-|-executable) and any caches used 
by the operating system to buffer I/O on behalf of the pro¬ 
gram. As actual limit we use the value given to the boxing 
and shown on the X-Axis plus a fixed 100MB (that accounts 
for the output buffer and the size of the executable). Results 
are shown in the second row of Fig. We see that probing 
is still very cheap even for the 5% memory setting; Provi¬ 
sioning the data now has noticeable costs for low-memory 
settings (25% and below). However, even then, it is mostly 
dominated by the time to actually perform the in-memory 
joins. This is even more so for the real-world data sets. 
Overall, with around 25% or more memory, boxed LFTJ’s 
performance stays constant indicating that I/O is not the 
bottleneck. For example, we can count all 37 billion trian¬ 
gles in the TWITTER dataset in around 29 minutes without 
I/O and only need up to 35 minutes with disk I/O. 

In the third row of Fig.[^ we show number of boxes used 
as well as the total amount of memory copied for provision¬ 
ing as a multiple of the TrieArray input size from Fig.|^ We 
see that the number of boxes is generally below 100 unless 
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Figure 7: Boxed LFTJ Analysis. On the X-Axis, we vary the amount of total memory available for boxing shown in GB. 
First row show total runtime in seconds without OS-level memory-restrictions and warm caches to evaluate the additional 
CPU work necessary for boxing. For performance in an out-of-core scenario, we enforce OS-level memory restrictions and have 
all caches cleared before execution in the second row. The third row shows the number of boxes and the amount of provisioned 
memory in multiples of the size of the input data. Omitted graphs for {RAND|RMAT}16 look like the “80” variants. 


the memory is restricted to below 25%; similarly, we never 
copy more than 15x of the input data even for a 5% mem¬ 
ory restriction. An example for how the boxes were chosen 
for the TWITTER data set is shown in Fig. Each figure 
shows the front (x-y) plane of the 3-D input space. Darker 
pixels stand for more data of the represented area. We see 
that boxes become smaller around the more data-dense ar¬ 
eas. See Apx. [Djfor more details. 
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Figure 8: Selected boxes for TWITTER dataset 


Last, we are interested in how the boxed LFTJ compares 
to a variant without our extension. Since LFTJ as presented 
in is a family of algorithms that needs to be parame¬ 
terized by how data is physically stored and how the Trielt- 
erator operations are implemented, answering this question 
is hard since conclusions for one specific implementation of 
the data back-end might not hold for another. In particular, 
our approach of storing data in huge arrays and performing 
mostly binary searches over them might be particularly bad 
from an I/O perspective. However, having these consider¬ 
ations in mind, we also ran our version of LFTJ with the 
cgroup memory restrictions and a provisioning mode that 
does not copy the data but leaves it in memory-mapped 
file^ The data is thus paged in (from the input file) by 
the Linux virtual memory system that using a standard re¬ 
placement strategy. Results for this experiment are shown 
in Fig.[^ The average speed ratios of vanilla over boxed for 

®We also experimented with this so-called lazy provisioning 
for boxed LFTJ: here, lazy and eager show about the same 
performance; we omited the data for space reasons. 



vanilla i i boxed gv-V—■ vanilla i i boxed tV-V-V. 

(a) Limit: 25% of input (b) Limit: 35% of input 


Figure 9: Vanilla vs. Boxed LFTJ. For our LFTJ im¬ 
plementation based on TrieArrays. Y-Axis shows wall-clock 
runtime in seconds. Memory restricted as mentioned. 


the memory levels of 10%, 25%, and 35% are 65x, 30x, and 
20 x, respectively. 

How does boxed LFTJ compare to specialized best- 
in-class competitors for triangle listings? We com¬ 
pare to (1) the triangle counting algorithm presented in 
Shank’s dissertation 32 which has been implemented for the 


graph analysis framework Graphlab [^. We chose this al¬ 
gorithm as our in-memory competitor since it supports mul¬ 
tiple threads and was used in other comparisons before. 
We also (2) compare to the MGT algorithm by Hu, Tao, and 
Chung as the (to the best of our knowledge) currently 
best triangle listing algorithm in the out-of-core setting. Our 
results are shown in Fig. |10| and Fig. EH The boxed LFTJ 
is on average 65% slower than Graphlab, both when run in 
single-threaded mode as well as in multi-threaded mode with 
8 threads. Graphlab, being optimized for an in-memory set¬ 
ting with optional distributiorQ was not able to run any of 
our large data sets getting “stuck” once all of the 32GB of 
main memory and 32GB of swap space had been consumed. 

Comparing to MGT (cf. Fig. [TT|: We used the cgroup- 


^which we did not evaluate 
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Figure 10: Performance Graphlab vs. Boxed LFTJ 
for single and multi threaded configurations. No re¬ 
source limitations. Y-axis shows runtime in seconds. 
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Figure 11: Comparison: Boxed LFTJ (1,8 threads) 
vs. MGT (1 thread) with limited memory. Y-Axis 
shows wall-clock runtime in seconds. 

memory restrictions and cleaned caches for running MGT 
and boxed-LFTJ. When we run in single-threaded mode, 
then MGT outperforms boxed LFTJ by a factor of 3.1, 2.9, 
and 2.9 in the configurations with 10%, 25%, and 35% of the 
memory, respectively. Due to time constraints, we did not 
run LFTJ in single-threaded mode for other configurations. 
When we allow LFTJ to utilize all of the 4 available cores, 
we are on average 47%, 22%, and 28%, respectively, faster 
than the single-threaded MGT. We have not investigated 
how well MGT parallelizes. Note that MGT internally uses 
only 32 bits as node identifiers (vs. our 64bit identifiers). 
Nevertheless, we used the same values to configure and limit 
the amount of memory for both MGT and LFTJ. 


7. RELATED WORK 

Related work spans multiple areas at different levels of 
generality. From most broad to more specific: 

The SociaLite effort [M 34 at Stanford also proposes to 
use systems based on relational joins (in this case Datalog) 
for graph analysis. They show that declarative methods not 
only allow for more succinct programs but are also compet¬ 
itive, if not outperform typical other implementations. We 
did not compare our join performance with the SociaLite 
system as it is clearly more feature-rich; it is also Java-based 
which might or might not influence performance in ways or¬ 
thogonal to our investigation. We note that the benchmarks 
presented in and that-among other queries-evaluate 
counting triangles did not use datasets as large as ours. 

A worst-case optimal join algorithm has first been pre¬ 
sented by Ngo et al. in 21 following the AGM bound 
that bounds the maximum number of tuples that can be 
produced by a conjunctive join. Leapfrog Triejoin by Veld- 
huizen 36 , the join algorithm we are using, has been shown 


to be worst-case optimal as well (modulo a log-factor). In 
fact, showed that Leapfrog Triejoin is worst-case optimal 
(modulo log-factor) for more fine-grained families of inputs. 


Our work, especially on the worst-case optimality for graphs 
with limited arboricity was inspired by the worst-case opti¬ 
mal results in [^. A good survey and description of this 
class of worst-case optimal join algorithms is , where the 
authors not only describe the AGM bound and its applica¬ 
tion, but also the original NPRR algorithm and LFTJ. 

Most recently, Khamis, Ngo, Re, and Rudra proposed so- 
called beyond-worst-case-optimal join algorithms. Here, the 
performed work is not measured against a worst-case within 
a set family of inputs—but instead must be proportional to 
the size of a shortest proof of the results correctness. The 
idea was proposed by Ngo, Nguyen, Re and Rudra in [20| . 
Furthermore, combines ideas from geometry and reso¬ 
lution transforming the algorithmic problem of computing 
joins to a geometric one. Following this line of research is 
very interesting as it might offer even better performance in 
practice. 

Our boxing approach is most closely related to the classic 
block-nested loop join (BNL J) [29| . An interesting avenue for 
future work would be to investigate how optimizations and 
results for the BNLJ transfer to the multi-predicate LFTJ. 

Listing triangles in graphs is a well-researched area in com¬ 
puter science. For the in-memory context, see for a re¬ 
cent survey. Triangle listing can also be reduced to matrix 
multiplication. Recent work that proposes new algorithms 
based on this approach is [^. Ghiba and Nishizeki pro¬ 
pose an in-memory triangle listing algorithm that runs in 
0{\E\a{G)) matching the best possible bound we give in 
Section To the best of our knowledge, our insight that 
this is the best possible theoretical bound for this class of 
graphs, is novel and thus provides new insights about these 
algorithms. Earlier, already showed that enumerating 
all triangles in planar graphs is a linear-time problem. 

Triangle listing in the out-of-core context: Following up 
on the MGT work [10| , Rasmus and Silvestri investigate the 
I/O complexity of triangle listing 23 . They improve on 


the I/O complexity of MGT from C){\E\'^/{MB)) to an 
pected 0{E^^'^/{y/MB)). They also give lower bounds and 
show that their algorithm is worst-case optimal by proving 
that any algorithm that enumerates K triangles needs to use 
at least Q,{K/{y/MB)) I/Os. They also give a determinis¬ 
tic algorithm using a color coding technique. Investigating 
whether the techniques used could be generalized to general 
joins is a very interesting avenue for future work. Prior to 
[10| , proposed an algorithm whith an I/O complexity of 
0(|/E| 4- \E\^'^/B)\ furthermore M proposed an algorithm 
with an I/O complexity of /B ■ logj^j^g{\E\/B)). 

Cheng et al. study the general problem of finding maxi¬ 
mal cliques. We did not benchmark against these algorithms 
since MGT dominated them by an order of magnitude. 

Research has also been done to distribute triangle count¬ 
ing and other graph algorithms 
use the MapReduce framework [271128| |35| |M] . 


8. CONCLUSION 

For the well-studied problem of triangle listing, we have in¬ 
vestigated how a general-purpose & worst-case optimal join 
algorithm compares against specialized approaches in the 
out-of-core context. By using Leapfrog Triejoin, we were 
able to devise a strategy that not only allows for good the¬ 
oretical bounds in terms of I/O and CPU costs but we also 
demonstrated very good performance: For very large input 
graphs of 1.2 billion edges and more, LFTJ counts triangles 









































































with a speed of 4 million input edges per second for uni¬ 
formly random data; and performs a complete count of the 
35 billion triangles in the twitter dataset in little over 25 min¬ 
utes on a standard 4-core desktop machine while limiting the 
available main memory to around 5GB. Our positive results 
can be interpreted as a confirmation for the database com¬ 
munity’s theme of creating systems to empower (domain- 
expert) users via declarative query interfaces while providing 
very good performance. 
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APPENDIX 

A. LEAPFROG JOIN AND 
LEAPFROG TRIEJOIN 


Algorithm 3 Leapfrog join of n unary relations 


in: Array of Linearlterators Iters[0..n-1] 
variables: int i, bool atEnd 
1: procedure lfj-init() 

2: if any iterator in Iters[0..n-1] is atEnd() then 

3: atEnd e— true t> Some input is empty 

4: else 

5: sort Iters[0..n-1] increasingly by value() 

6: i <— 0 ; atEnd •<— false 

7: lfj-SEARCh() 

8: procedure lfj-SEARCh() 

9: while true do 

10: if Iters[i — 1 mod n ].AtEnd() then 

11: atEnd <— true 

12: return > No tuple can be found anymore 

13: max_value Iters[i — 1 mod n ].VALUe() 

14: mimvalue •<— Iters[i].VALUE() 

15: if min_value = max_value then 

16: return > Found tuple in intersection 

17: else 

18: Iters[i]. seek (max_value) ; i ■<— i + 1 mod n 

19: procedure lfj-next() 

20: Iters[i].NEXT() ; i -4— i + 1 mod n 

21: lfj-SEARCh() 

22: procedure LFj-SEEK(val) 

23: Iters[i].SEEK(val) ; i -4— i + 1 mod n 

24: lfj-SEARCh() 

25: function lfj-value(): return Iters[0].VALUE() 

26: function lfj-AtEnd(): return atEnd 
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Algorithm 4 Leapfrog Triejoin with n variables 

in: Array of Set of LeapfrogJoins Lfs[l..n] 
variables: int d l> Current depth 

1: procedure lftj-init(): d 4— 0 
2: procedure lftj-OPEn() 

3: d •4— d + 1 

4: for all iter used in Lfs[d] do: iter.OPE n{) 

5: Lfs[d].LFj-lNlT() 

6: procedure lftj-CLOSe() 

7: for all iter used in Lfs[d] do: iter.CLOS e{) 

8: d -4” d — 1 

9: procedure lftj-next(): Lfs[d].LFj-NEXT() 

10: procedure lftj-SEEk(v): Lfs[d].LFj-SEEK(v) 

11: function lftj-value(): return Lfs[d].LFj-VALUE() 

12: function lftj-AtEnd(): return Lfs[d].LFj-ATEND() 


movement is restricted to direct siblings, which are accessed 
via the Linearlterator interface that comprises the methods 
atEnd, next, seek, and value. It is convenient to think 
of the I children of a node n to be stored increasingly sorted 
in an array A of size 1. The methods BOOL AtEnd() re¬ 
turns true if the iterator is positioned after the last element 
(eg., at position 1). The method next() requests to move 
to the next element; AtEnd will be true if the iterator was 
at the last position already (e.g., calling next() at position 
1—1). The method SEEk(T u) can be used to forward- 
position the iterator to the element with value v; if u is not 
in A, then the iterator is placed at the element with the 
smallest value w > v, or AtEnd if no such w exists. Finally, 
data is accessed at granularity of a single domain element 
via the method T value(), which returns the element at 
the current position. The methods open, next, seek, and 
VALUE may only be called if AtEnd() is false; furthermore, 
the value v given to SEEk(v) must be at least value(); and 
VALUe() must not be called at the root node r. 


The iterator is initially positioned at r; OPEn() moves it to a, 
followed by next() to b. Here, open() moves to u; next() 
to v; and a SEEk(w) will position the iterator to z since z is 
the smallest among u,v,z which is larger than w. A call to 
next() causes AtEnd() to return true after which CLOSe() 
would be the only allowed operation, moving the iterator 
back to b. 

A.2 Leapfrog Triejoin Procedure 

Given a join description as a Datalog rule body with m 
atoms and n variables. For each of the m atoms, a sin¬ 
gle Trielterator is created. Furthermore, LFTJ maintains 
an array of n Leapfrog joins—one join for each variable. 
The LFJ for variable Xi uses pointers to the Trielterators 
for atoms that mention the variable Xi. Overall, LFTJ is 
implemented as a Trielterator itself (see Algorithm §. A 
variable d remembers at which level of the output trie the 
iterator is positioned. The horizontal navigation methods 
manipulate d, open and close the appropriate Trielterators, 
and initialize the Leapfrog joins. The linear iterator meth¬ 
ods are then simply delegated to the LFJ which computes 
the appropriate intersections. 


Example 20 (Trielterator Navigation) See Figure 1(b) 


A.I Trielterator Example 

A Trielterator is initialized to the root node r. Methods 
for vertical navigation are: OPEn() for moving “down” to 

the first children of the current node and CLOSE () for mov- _____ 

ing “up” to the parent of the current node. Horizontally, ®The actual join results are collected by walking the Trie. 
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(a) Graph G (b) LFTJ-A steps when running on G 


Figure 12: Example input graph that causes 

LFTJ-A to use many I/Os. Parameters: M = 20, 
B — A and graph with A = 24 edges. 


B.l Proof for Prop. 

Consult Fig. |12(b)| The variable assignments for x :=a, 
y := b, and 2 := c as well as the corresponding neighbors D{a) 
and D{h) are shown. Each node in Vi causes two block I/Os. 
Further, the block storing the node with id 24 of Vi will be 
evicted when x = 5 and y = A or earlier, and the last block 
with 24 is thus repeatedly loaded when x = 6, x = 12, and 
X = 18. 

Detailed proof sketch for general case: The outer loop in 
line 1 of Algorithm ranges from 0 to N. For each value 
X :=a, we then join a’s neighbors with Vi (line 2) to obtain 
bindings for y. Since each node a has exactly one neighbor 
b, this essentially performs a lookup of b in the first column 
of E. Now, since we spaced the second values in E with 
a distance of B apart, locating each h within E incurs at 
least one I/O. Also, since the second values in E repeat in 
groups of size T, the blocks needed for the second group 
will have been evicted from memory before they are needed, 
resulting in a single I/O for each tuple in E. The last step 
is to intersect the neighbors of a with the neighbors of b. In 
our TrieArray representation, this will incur another i/oQ 
□ 

B.2 Proof for Proposition [I^ 

The bound on the runtime can easily be obtained from the 
worst-case optimality wrt. input sizes of LFTJ (Corollary 4.3 
in [^) and the fractional edge-cover bound For any 
three binary relations R, S, T the result size IQf of the join 
Q{x,y,z) R{x,y), S{y, z),T{x, z) is limited according to 
the fractional edge cover [^. If the sizes of R, S, and T agree 
than \Q\ is at most with n = |i?| = ISI = |r|; adding the 
log-factor, we obtain the desired bound of 0(log |i5| 

The complexity is optimal modulo the log-factor since a 
graph with \E\ edges can have I2(|i?|^'®) triangles. □ 


®When storing relations in B-Trees or as an array of lexico¬ 
graphically sorted tuples the single neighbor of b might al¬ 
ready be available once b has been loaded. However, even the 
reduced I/O cost of at least |E(Gjv)| demonstrates thrash¬ 
ing. 


B.3 Proof for Theorem UTl 


Let a be as required. We now analyze the work done by 
LFTJ-A on a graph G with its directed version G* = {V, E) 
(possibly obtained via a 0(|i?| log |i?|) preprocessing. Let 
Vi be all nodes in E that have an outgoing edge as usual. It 
is useful to also consult Fig. [^for an explanation of which 
Leapfrog joins are executed during LFTJ-A. We now count 
the steps at each variable: 

• At level X'. We Leapfrog-join Vi with itself yielding a 
bound of 0(|i?| log |i?|) based on the requirements for 
the Trielterator operations (see Section 2.11. 

• At level y: for each x € Vi, a leapfrog-join is performed 
between D{x) and Vi. As usual, D{x) are the followers 
of X, i.e., D{x) = {y \ {x,y) G E}. Summing up all cost 
and using that the runtime of a leapfrog-join between 
two relations of size si and S 2 , respectively, is bound 
by 0(log(max{si, S 2 }) • min{si,S 2 }), we obtain: 
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• At level z: Here, for (at most) each edge [x, y) in E we 
leapfrog join the neighbors of x with the neighbors of 
y. We thus incur the work: 

0( Y/. log(max{d3;, dy}) minidj,, dy} ) 

C 0( Y log(max{d„})min{di,,dy} ) 

{x,y)(^E "SV 

C 0( log(max{dj;}) • Y niin{da:,dy} ) 

G,y)(SE 

C 0( log|F|- x; min{d,„,dy} ) 

(x,y)eE 

As Lemma 2, Chiba and Nishizeki M observed that for 
any graph G = (V,E), the sum nrinld^,, dy} 

is bounded by 2a{G)\E\. Since ol{G) < d(|i?|) and 
because a is monotonically increasing, we can bound 
the work by 0(log |i?|Q(|E|)|i?|), finishing the proof. 

B.4 Proof for Theorem [l9l 

We first show: 


Lemma 21 Let d : N — >■ be an arbitrary monotonically 

increasing, computable function. For any m G N there exists 
a graph with m edges and arboricity at most a{m) with at 
least ^ma{m) — |m — |d(m)^ — |d(m)^ triangles. 

Proof. Informal overview of technique. To get many 
triangles, we use fully connected graphs K^; to stay under 
the arboricity limit, we choose k appropriately; to get many 
edges, we just union many of these into the graph, and 
then filling up with singleton edges. The math works out to 
the above quantity. 

Formal proof. Let a be as required. Fix an m G N. Let 
k = 2d(m). Note that the fully connected graphs Kk with 
k nodes have I — k{k — l)/2 edges. We construct a graph 
G by packing as many Kk as we can fit into our “m-edges 
budget” and filling up the rest with unconnected edges: Let 
n = [m/l], let G be the graph composed of n instances of 
Kk and m — nl single edges not connected to anything else. 
To complete the proof, we show: (1) The arboricity of G is 




















a{m), and (2) G has at least ^ma{m) — |m — — 

^a{m)^ triangles. 

Showing (1). The classic Nash-Williams result states 
that for any graph G, its arboricity a(G) is characterized by 
the maximum edge-node ratio among all its subgraphs: 


q(G) = max 

iS is subgraph of G 


\E{S)\ 


It can easily be verified that choosing a Kf^ as subgraph 
maximizes the ratio. Thus, aiG) = a{Kk) = \k/2] = a(m). 

Showing (2) As short-hand let a = a(m}, and let m' = nl, 
which is the largest integer multiple of I that is not larger 
than m. Each Kk has (g) = k{k — l){k — 2)/6 = l{k — 2)/3 
triangles, and we have n of them, totaling in 


nl{k — 2)/S = — 2) 

k = 2a 

= \vn! (a — 1 ) 

[> m' > m — Z -I- 1 

> |(m — I + l)(a — 1) 


= I {ma — m — la + l + a — 1) 

= |(ma — m — l{a -|- 1) -|- a — 1) 

> I = 2a^ — a 

= |(ma — m — (2a^ — a)(a -|- 1) -|- a — 1) 
= ^{ma — m — (2a® -|- a® — a) -f a — 1) 

= I (ma — m — 2a® — + a + a — 1) 

= |(ma — m — 2a® — a® -|- 2a — 1) 

= ^{ma — m — 2a® — (a — 1)®) 

[> a > 1 

> ^{ma — m — 2a® — a®) 


Clearly, A needs to take at least s{m*) steps on G*. Thus; 
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> m* > 8, a{m*) > 2 
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) 

contradicts Q □ 


C. SYSTEMS ASPECTS 

C.l Caches and Limiting Resident Memory 

To clear Linux file caches we used as root: 

sync && echo 3 > /proc/sys/vm/drop_caches 

We restricted the memory that a process uses for any rea¬ 
son (data, heap, program, caches, etc) using Linux cgroups. 
Investigating later via top, confirms that only the allowed 
resident memory is used by the process. We used as root 
commands such as: 

# create a group 

mkdir -p /sys/fs/cgroup/memory/limit_mem 

# add process to group 
echo $PID_0F_PR0CESS \ 

> /sys/fs/cgroup/memory/limit_mem/tasks 

# limit memory 
echo $LIMIT_BYTES \ 

> /sys/fs/cgroup/memory/limit_mem/\ 
memory.limit_in_bytes 


2 2 4 3 2 2 

= gma — gm— gQ “3^1 

triangles as required. □ 


We proof Theorem 19 indirect. Let d : N —>■ be an ar¬ 


bitrary, monotonically increasing, computable function, not 
identical to 1, that is in o(-^/n). And, let A be an algo¬ 
rithm that lists all triangles in graphs G = {V, E) with 
a{G) < a{\E\) in o(|£'|d(|L;|)) time. Let r^(m) : N N 
be the maximal number of steps A performs on any graph 
G = {y,E) with |i5| < m and Q!(G) < a. 

Since A runs in o{\E\a{\E\)) time: choose cq = 1/16 and 
let mo be such that for all m > mo we have: 


Taim) < —md(m) 


for all m > mo 


( 1 ) 


From a £ o{^/n): choose ei = l/-\/8 and let mi such that 
for all m > ra\ we have Q:(m) < 

Now, let m* £ N be a large enough number such that (1) 
m* > 8, (2) m* > mo, (3) m* > mi, and (4) a(m*) > 
2. We can satisfy all conditions since a maps into N”*", is 
monotonically increasing, and is not identical to 1. We apply 
Lemma with our & for m*, and conclude there is a graph 
G* with m* edges and arboricity at most a{m*) with at least 
s(m*) = |m*a(m*) — |m* — |Q:(m*)® — |Q;(m*)® triangles. 


D. MORE DETAILS FOR FIG. 8 

The input space for LFTJ-A is 3-dimensional. We box 
for atoms[l] = {E{x,y), E{x, z)} and atoms[2] = {E{y,z)}. 
Since there were no spills, intervals for dimension 2 are al¬ 
ways [—oo-oo]. The figures show how these boxes are created 
by projecting the 3-D input space onto the x-y plane. Darker 
pixels indicate areas where there is more data. In particular, 
the image was created as follows: For E{x, y) of the directed 
graph for the twitter dataset which can be viewed as a point- 
set in 2D space, create a 2D histogram H with 150x150 bins. 
Then, because we slice along the first dimension and collect 
the nodes plus their neighbors, we aggregate over H’s sec¬ 
ond dimension (eg, y) values to obtain a ID histogram D 
showing the total number of neighbors the nodes in a certain 
bin have. We then spread this ID histogram into a 2D space 
by setting the value at position x,y to D{x) + D{y). This 
“image” is indicative of the total amount of data for a rect¬ 
angular box. As a last step, we equalize the histogram and 
map into greyscale to have a prettier picture. The red boxes 
are then drawn on top according to the made provisioning 
decisions during the boxing procedure. In the picture the x- 
axis goes from bottom left to bottom right, the y axis from 
bottom-left to top-left—the same way as in Fig. 2(e) Note, 
that the number of columns corresponds to how often we 
need to load the input data at level y. 







